Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm
نویسندگان
چکیده
In this paper we present a non-languagespecific strategy that uses large amounts of monolingual data to improve statistical machine translation (SMT) when only a small parallel training corpus is available. This strategy uses word classes derived from monolingual text data to improve the word alignment quality, which generally deteriorates significantly because of insufficient training. We present a novel semantic word clustering algorithm to generate the word classes motivated by the word similarity metric presented in (Lin, 1998). Our clustering results showed this novel word clustering outperforms a state-of-the-art hierarchical clustering. We then designed a new procedure for using the derived word classes to improve word alignment quality. Our experiments showed that the use of the word classes can recover over 90% of the loss resulting from the alignment quality that is lost due to the limited parallel training.
منابع مشابه
Improving word alignment for low resource languages using English monolingual SRL
We introduce a new statistical machine translation approach specifically geared to learning translation from low resource languages, that exploits monolingual English semantic parsing to bias inversion transduction grammar (ITG) induction. We show that in contrast to conventional statistical machine translation (SMT) training methods, which rely heavily on phrase memorization, our approach focu...
متن کاملBilingual Word Spectral Clustering for Statistical Machine Translation
In this paper, a variant of a spectral clustering algorithm is proposed for bilingual word clustering. The proposed algorithm generates the two sets of clusters for both languages efficiently with high semantic correlation within monolingual clusters, and high translation quality across the clusters between two languages. Each cluster level translation is considered as a bilingual concept, whic...
متن کاملArabic-English Semantic Word Class Alignment to Improve Statistical Machine Translation
Clustering words is a widely used technique in statistical natural language processing. It requires syntactic, semantic, and contextual features. Especially, semantic clustering is gaining a lot of interest. It consists in grouping a set of words expressing the same idea or sharing the same semantic properties. In this paper, we present a new method to integrate semantic classes in a Statistica...
متن کاملParaphrasing Out-of-Vocabulary Words with Word Embeddings and Semantic Lexicons for Low Resource Statistical Machine Translation
Out-of-vocabulary (OOV) word is a crucial problem in statistical machine translation (SMT) with low resources. OOV paraphrasing that augments the translation model for the OOV words by using the translation knowledge of their paraphrases has been proposed to address the OOV problem. In this paper, we propose using word embeddings and semantic lexicons for OOV paraphrasing. Experiments conducted...
متن کاملSemantically-Informed Syntactic Machine Translation: A Tree-Grafting Approach
We describe a unified and coherent syntactic framework for supporting a semanticallyinformed syntactic approach to statistical machine translation. Semantically enriched syntactic tags assigned to the target-language training texts improved translation quality. The resulting system significantly outperformed a linguistically naive baseline model (Hiero), and reached the highest scores yet repor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011